{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0c031327-2456-41a2-b0ef-975bf96823c7",
   "metadata": {},
   "source": [
    "## How to add metadata to your documents and filter searches\n",
    "This notebook will walk you through how to upload metadata that provides extra information about the corpus you are ingesting with nv-ingest. It will show the requirements for the metadata file and what file types are supported. Then we will go throught he process of filtering searches, in this case, on the metadata we provided.\n",
    "\n",
    "First step is to provide imports for all the tools we will be using."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8d32ff2e-ab3c-4118-9d74-ef3c63837003",
   "metadata": {},
   "outputs": [],
   "source": [
    "from nv_ingest_client.client import Ingestor\n",
    "from nv_ingest_client.util.milvus import nvingest_retrieval\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e18ab4bf-6a00-4008-aa10-87741369fad1",
   "metadata": {},
   "source": [
    "Next we will annotate all the necessary variables to ensure our client connects to our pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a902bd2d-cf8e-4b68-8a98-a5b535e440d5",
   "metadata": {},
   "outputs": [],
   "source": [
    "model_name=\"nvidia/llama-3.2-nv-embedqa-1b-v2\"\n",
    "hostname=\"localhost\"\n",
    "collection_name = \"nv_ingest_collection\"\n",
    "sparse = True"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6dde8506-44c7-4536-96c8-4cc1d273ba46",
   "metadata": {},
   "source": [
    "Now, we will begin by creating a dataframe with dummy metadata in it. The metadata can be ingested as either a dataframe or a file. Supported file types (json, csv, parquet). If you supply a file it will be converted into a pandas dataframe for you. In this example, after we create the dataframe, we write it to a file and we will use that file as part of the ingestion."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9f6b451d-40d8-46d8-88c5-aac7facd278d",
   "metadata": {},
   "outputs": [],
   "source": [
    "meta_df = pd.DataFrame(\n",
    "    {\n",
    "        \"source\": [\"data/woods_frost.pdf\", \"data/multimodal_test.pdf\"],\n",
    "        \"meta_a\": [\"alpha\", \"bravo\"]\n",
    "    }\n",
    ")\n",
    "file_path = \"./meta_df.csv\"\n",
    "meta_df.to_csv(file_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "157d8909-542b-47fd-b01c-6689eefdaf11",
   "metadata": {},
   "source": [
    "If you are supplying metadata during ingestion you are required to supply three keyword arguments.\n",
    "\n",
    "- meta_dataframe - This is either a string representing the file (to be loaded via pandas) or the already loaded dataframe.\n",
    "- meta_source_field - This is a string, that represents the field that will be used to connect to the document during ingestion.\n",
    "- meta_fields - This is a list of strings, representing the columns of data from the dataframe that will be used as metadata for the corresponding documents.\n",
    "\n",
    "All three of the parameters are required to enable metadata updates to the documents during ingestion.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d6f9e2a4-7e50-491d-a0c6-21a4d4f27db9",
   "metadata": {},
   "outputs": [],
   "source": [
    "ingestor = ( \n",
    "    Ingestor(message_client_hostname=hostname)\n",
    "    .files([\"data/woods_frost.pdf\", \"data/multimodal_test.pdf\"])\n",
    "    .extract(\n",
    "        extract_text=True,\n",
    "        extract_tables=True,\n",
    "        extract_charts=True,\n",
    "        extract_images=True,\n",
    "        text_depth=\"page\"\n",
    "    ).embed(text=True, tables=True\n",
    "    ).vdb_upload(collection_name=collection_name, milvus_uri=f\"http://{hostname}:19530\", sparse=sparse, minio_endpoint=f\"{hostname}:9000\", dense_dim=2048\n",
    "                 ,meta_dataframe=file_path, meta_source_field=\"source\", meta_fields=[\"meta_a\"]\n",
    "                )\n",
    ")\n",
    "results = ingestor.ingest_async().result()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b722d073-5a87-4109-acc7-9c1d4399b625",
   "metadata": {},
   "source": [
    "Once the ingestion is complete, the documents will have uploaded to the vector database with the corresponding metadata as part of the `content_metadata` field. This is a json field that can be used as part of a filtered search. To use this, you can select a column from the meta_fields previously described and filter based on a value for that sub-field. That is what is done in this example below. There are more extensive filters that can be applied, please refer to https://milvus.io/docs/use-json-fields.md#Query-with-filter-expressions for more information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1396906e-321a-4ab6-af83-9e651a51cb7f",
   "metadata": {},
   "outputs": [],
   "source": [
    "queries = [\"this is expensive\"]\n",
    "top_k = 5\n",
    "q_results = []\n",
    "for que in queries:\n",
    "    q_results.append(nvingest_retrieval([que], collection_name=collection_name, host=f\"http://{hostname}:19530\", embedding_endpoint=f\"http://{hostname}:8012/v1\",  hybrid=sparse, top_k=top_k, model_name=model_name, gpu_search=False\n",
    "                                            , _filter='content_metadata[\"meta_a\"] == \"alpha\"'\n",
    "                                           ))\n",
    "\n",
    "print(f\"{q_results}\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
